Goto

Collaborating Authors

 vision and touch




Active 3D Shape Reconstruction from Vision and Touch

Neural Information Processing Systems

Humans build 3D understandings of the world through active object exploration, using jointly their senses of vision and touch. However, in 3D shape reconstruction, most recent progress has relied on static datasets of limited sensory data such as RGB images, depth maps or haptic readings, leaving the active exploration of the shape largely unexplored. In active touch sensing for 3D reconstruction, the goal is to actively select the tactile readings that maximize the improvement in shape reconstruction accuracy. However, the development of deep learning-based active touch models is largely limited by the lack of frameworks for shape exploration. In this paper, we focus on this problem and introduce a system composed of: 1) a haptic simulator leveraging high spatial resolution vision-based tactile sensors for active touching of 3D objects; 2) a mesh-based 3D shape reconstruction model that relies on tactile or visuotactile signals; and 3) a set of data-driven solutions with either tactile or visuotactile priors to guide the shape exploration. Our framework enables the development of the first fully data-driven solutions to active touch on top of learned models for object understanding. Our experiments show the benefits of such solutions in the task of 3D shape understanding where our models consistently outperform natural baselines. We provide our framework as a tool to foster future research in this direction.


3D Shape Reconstruction from Vision and Touch

Neural Information Processing Systems

When a toddler is presented a new toy, their instinctual behaviour is to pick it up and inspect it with their hand and eyes in tandem, clearly searching over its surface to properly understand what they are playing with. At any instance here, touch provides high fidelity localized information while vision provides complementary global context. However, in 3D shape reconstruction, the complementary fusion of visual and haptic modalities remains largely unexplored. In this paper, we study this problem and present an effective chart-based approach to multi-modal shape understanding which encourages a similar fusion vision and touch information. To do so, we introduce a dataset of simulated touch and vision signals from the interaction between a robotic hand and a large array of 3D objects. Our results show that (1) leveraging both vision and touch signals consistently improves single-modality baselines; (2) our approach outperforms alternative modality fusion methods and strongly benefits from the proposed chart-based structure; (3) the reconstruction quality increases with the number of grasps provided; and (4) the touch information not only enhances the reconstruction at the touch site but also extrapolates to its local neighborhood.


Vi-TacMan: Articulated Object Manipulation via Vision and Touch

Cui, Leiyao, Zhao, Zihang, Xie, Sirui, Zhang, Wenhuan, Han, Zhi, Zhu, Yixin

arXiv.org Artificial Intelligence

Autonomous manipulation of articulated objects remains a fundamental challenge for robots in human environments. Vision-based methods can infer hidden kinematics but can yield imprecise estimates on unfamiliar objects. Tactile approaches achieve robust control through contact feedback but require accurate initialization. This suggests a natural synergy: vision for global guidance, touch for local precision. Yet no framework systematically exploits this complementarity for generalized articulated manipulation. Here we present Vi-TacMan, which uses vision to propose grasps and coarse directions that seed a tactile controller for precise execution. By incorporating surface normals as geometric priors and modeling directions via von Mises-Fisher distributions, our approach achieves significant gains over baselines (all p<0.0001). Critically, manipulation succeeds without explicit kinematic models -- the tactile controller refines coarse visual estimates through real-time contact regulation. Tests on more than 50,000 simulated and diverse real-world objects confirm robust cross-category generalization. This work establishes that coarse visual cues suffice for reliable manipulation when coupled with tactile feedback, offering a scalable paradigm for autonomous systems in unstructured environments.





Review for NeurIPS paper: 3D Shape Reconstruction from Vision and Touch

Neural Information Processing Systems

I would suggest removing this claim -- in machine learning we seem to anthropomorphize our algorithms with little evidence. "...touch provides high fidelity localized information while vision provides complementary global context" A counterexample to this claim is the case of congenitally blind people who seem to have no problem describing global context of things they touch. See "Imagery in the congenitally blind: How visual are visual images?", Zimler and Keenan 1983 - The way the paper presents the idea of using charts makes it seem like it is a novel contribution, but in reality it is built on top of AtlasNet, who also use the term chart to describe their method. In fact, a follow up paper to AtlasNet [a] generalizes the charts idea even further which the paper does not cite. Therefore, I would suggest toning down statements that make it seem like this is a novel contribution such as "...which we call charts."


Review for NeurIPS paper: 3D Shape Reconstruction from Vision and Touch

Neural Information Processing Systems

This paper proposes to fuse vision and haptic information to reconstruct 3D shapes for robotic hand manipulation. The reconstruction is done by representing the objects as a collection of deformable meshes (defined as charts in the previously published AtlasNet paper). The merging of the vision and touch charts is done using graph convolutional networks, with local and cross-modality communication between charts. Experiments are conducted in simulation, on a new dataset designed by the authors, with known hand and object surface structure, and vision and touch inputs. After rebuttal, reviewers gave scores between 6 and 7.